Advanced Growth Modeling with pcvr

DDPSC Data Science Core, August 2023

Josh Sumner

Outline

  • pcvr Goals
  • Load Package
  • Why Longitudinal Modeling?
  • Why Bayesian Modeling?
  • Supported Curves
  • growthSS
  • fitGrowth
  • Model Visualization
  • Hypothesis testing
  • Using brms directly
  • Resources

pcvr Goals

Currently pcvr aims to:

  • Make common tasks easier and consistent
  • Make select Bayesian statistics easier

There is room for goals to evolve based on feedback and scientific needs.

Load package

Pre-work was to install R, Rstudio, and pcvr with dependencies.

library(pcvr) # or devtools::load_all() if you are editing
library(brms) # for bayesian models
library(data.table) # for fread
library(ggplot2) # for plotting
library(patchwork) # to arrange ggplots

Why Longitudinal Modeling?

plantCV allows for user friendly high throughput image based phenotyping

Resulting data follows individuals over time, which changes our statistical needs.

Why Longitudinal Modeling? 2

Longitudinal Data is:

  • Autocorrelated
  • Often non-linear
  • Heteroskedastic

Why Longitudinal Modeling? 3

Why Longitudinal Modeling? 4

Why Longitudinal Modeling? 5

Why Bayesian Modeling?

Bayesian modeling allows us to account for all these problems via a more flexible interface than frequentist methods.

Bayesian modeling also allows for non-linear, probability driven hypothesis testing.

Why Bayesian Modeling? 2

In a Bayesian context we flip “random” and “fixed” elements.

Fixed Random Interpretation
Frequentist True Effect Data If the True Effect is 0 then there is an \(\alpha\cdot100\)% chance of estimating an effect of this size or more.
Bayesian Data True Effect Given the estimated effect from our data there is a P probability of the True Effect being a difference of at least X

Supported Growth Models

There are 6 main growth models supported in pcvr.

3 are asymptotic, 3 are non-asympototic.

Supported Growth Models

Supported Growth Models 2

There are also two double sigmoid curves intended for use with recovery experiments.

growthSS

Any of the models shown above can be specified easily using growthSS.

growthSS - form

With a model specified a rough formula is required to parse your data to fit the model.

The layout of that formula is:

outcome ~ time|individual/group

growthSS - form 2

Here we would use y~time|id/group

simdf<-growthSim("gompertz", n=20, t=25,
                 params = list("A"=c(200,160),
                               "B"=c(13, 11),
                               "C"=c(0.2, 0.25)))
head(simdf)
    id group time           y
1 id_1     a    1 0.002499967
2 id_1     a    2 0.015541085
3 id_1     a    3 0.071661525
4 id_1     a    4 0.257371721
5 id_1     a    5 0.749980499
6 id_1     a    6 1.834833412

growthSS - form 3

Generally it makes sense to visually check that your formula covers your experimental design.

Note that it is fine for id to be duplicated between groups, but not within groups

growthSS - sigma

Recall the heteroskedasticity problem, shown again with our simulated data:

growthSS - sigma

There are lots of ways to model a trend like that we see for sigma.

pcvr offers four options through growthSS.

growthSS - Intercept sigma

Pros Cons
Faster model fitting Very inaccurate intervals at early timepoints

growthSS - Linear sigma

Pros Cons
Models still fit quickly Variance tends to increase non-linearly
Easy testing on variance model

growthSS - Linear sigma

growthSS - Gompertz sigma

Pros Cons
Models fit much faster than splines Slightly slower than linear sub-models
Variance is often asymptotic Requires priors on sigma model
Easy testing on variance model

growthSS - Gompertz sigma

growthSS - Spline sigma

Pros Cons
Very flexible and accurate model for sigma Significantly slower than other options
Fewer priors Splines can be a black-box

growthSS - Spline sigma

growthSS - other sigma models

You can always add a new sigma formula if something else fits your needs better.

growthSS - priors

Bayesian statistics combine prior distributions and collected data to form a posterior distribution.

Luckily, in the growth model context it is pretty easy to set “good priors”.

growthSS - priors

“Good priors” are generally mildly informative, but not very strong.

They provide some well vetted evidence, but do not overpower the data.

growthSS - priors

For our setting we know growth is positive and we should have basic impressions of what sizes are possible.

At the “weakest” side of these priors we at least know growth is positive and the camera only can measure some finite space.

growthSS - priors 2

Default priors in growthSS are log-normal

\(\text{log}~N(\mu, 0.25)\)

This has the benefit of giving a long right tail and strictly positive values while only requiring us to provide \(\mu\).

growthSS - priors 3

We can see what those log-normal distributions look like with plotPrior.

priors = list("A" = 130, "B" = 10, "C" = 0.2)
priorPlots<-plotPrior(priors)
priorPlots[[1]]/priorPlots[[2]]/priorPlots[[3]]

growthSS - priors 3

growthSS - priors 4

Those distributions can still be somewhat abstract, so we can simulate draws from the priors and see what those values yield in our growth model.

twoPriors = list("A" = c(100, 130), "B" = c(6, 12), "C" = c(0.5, 0.25))
plotPrior(twoPriors, "gompertz",n=100)[[1]]

growthSS - priors 4

growthSS - priors 5

Our final call to growthSS will look like this for our sample data.

ss<-growthSS(model="gompertz", form =  y~time|id/group, sigma="gompertz", df=simdf,
             priors = list("A" = 130, "B" = 10, "C" = 0.5, "subA"=20, "subB"=10, "subC"=0.25))

fitGrowth

Now that we have the components for our model from growthSS we can fit the model with fitGrowth.

fitGrowth 2

This will call Stan outside of R to run Markov Chain Monte Carlo (MCMC) to get draws from the posterior distributions. We can control how Stan runs with additional arguments to fitGrowth, although the only required argument is the output from growthSS.

fitGrowth 2

Here we specify our ss argument to be the output from growthSS and tell the model to use 4 cores so that the chains run entirely in parallel, but the rest of this model is using defaults.

fit <- fitGrowth(
  ss = ss, cores = 4, 
  iter = 2000, chains = 4, backend = "cmdstanr")

fitGrowth 2

Note that there are lots of arguments that can be passed to brms::brm via fitGrowth.

One that can be very helpful for fitting complex models is the control argument, where we can control the sampler’s behavior.

fitGrowth 2

adapt_delta and tree_depth are both used to reduce the number of “divergent transitions” which are times that the sampler has some departure from the True path and which can compromise the results.

fit <- fitGrowth(ss, cores = 4, 
                 iter = 2000, chains = 4, backend = "cmdstanr",
                 control = list(adapt_delta = 0.999, max_treedepth = 20) )

fitGrowth 3

fitGrowth returns a brmsfit object, see ?brmsfit and methods(class="brmsfit") for general information.

Within pcvr there are several functions for visualizing these objects.

Model Visualization

brmPlot can be used to plot credible intervals of your model.

brmPlot(fit, form=ss$pcvrForm, df=ss$df)

Model Visualization 2

These plots can show one of the benefits of an asymptotic sub model well.

Here we check our model predictions to 35 days.

brmPlot(fit, form=ss$pcvrForm, df=ss$df, timeRange=1:35)

Model Visualization 2

And now we check those predictions from a spline model, where the basis functions were not expecting to have data past day 25.

brmPlot(fit_spline, form=ss_spline$pcvrForm, df=ss_spline$df, timeRange=1:35)

Model Visualization 3

We can also plot the posterior distributions and test hypotheses with brmViolin.

Here hypotheses are tested with brms::hypothesis.

brmViolin(fit, hyp="num/denom>1.05",
          compareX = "a", againstY = "b", x = "group")

Hypothesis Testing

brms::hypothesis allows for incredibly flexible hypothesis testing.

Here we test for an asymptote for group A at least 20% larger than that of group B.

brms::hypothesis(fit, "A_groupa > 1.2 * A_groupb")$hyp
                     Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio
1 (A_groupa)-(1.2*A_groupb) > 0 6.290106  4.925797 -1.74309 14.41414    8.90099
  Post.Prob Star
1     0.899     

Hypothesis Testing 2

Here we test the model of our bellwether data to see if the ratio of two genotypes is 10% larger when fertilizer is added.

                                                           Hypothesis  Estimate
1 ((A_group0.B73/A_group0.MM))-(1.1*(A_group50.B73/A_group50.MM)) > 0 0.2412093
  Est.Error   CI.Lower  CI.Upper Evid.Ratio Post.Prob Star
1 0.3224859 -0.2874997 0.7497439   3.728132    0.7885     

Using brms directly

These functions are all to help use common growth models more easily.

The choices in pcvr are a tiny subset of what is possible with brms, which itself is more limited than Stan.

Using brms directly

Our gompertz sigma model looks like this in brms:

prior1 <- prior(gamma(2,0.1), class="nu", lb=0.001)+
  prior(lognormal(log(130), .25),nlpar = "A", lb = 0) +
  prior(lognormal(log(12), .25), nlpar = "B", lb = 0) + 
  prior(lognormal(log(1.2), .25), nlpar = "C", lb = 0)+
  prior(lognormal(log(25), .25),nlpar = "subA", lb = 0) +
  prior(lognormal(log(20), .25), nlpar = "subB", lb = 0) + 
  prior(lognormal(log(1.2), .25), nlpar = "subC", lb = 0)

form_b <- bf(y ~ A*exp(-B*exp(-C*time)), 
           nlf(sigma ~ subA*exp(-subB*exp(-subC*time))),
           A+B+C+subA+subB+subC ~ 0+group, 
           autocor = ~arma(~time|sample:group,1,1),
           nl = TRUE ) 

fit_g2 <- brm(form_b, family = student, prior = prior1, data = simdf,
             iter = 1000, cores = 4, chains = 4, backend = "cmdstanr", silent = 0,
             control = list(adapt_delta = 0.999,max_treedepth = 20),
             init = 0 ) # chain initialization at 0 for simplicity

Using brms directly

It can be more work to try new options in brms or Stan, but if you have a situation not well represented by the existing models then it may be necessary.

Resources

If you run into a novel situation please reach out and we will try to come up with a solution and add it to pcvr if possible.

Good ways to reach out are the help-datascience slack channel and pcvr github repository.